Web Page Structure Enhanced Feature Selection for Classification of Web Pages
نویسندگان
چکیده
Web page classification is achieved using text classification techniques. Web page classification is different from traditional text classification due to additional information, provided by web page structure which provides much information on content importance. HTML tags provide visual web page representation and can be considered a parameter to highlight content importance. Textual keywords are base on which Information retrieval systems rely to index and retrieve documents. Keyword-based retrieval returns inaccurate/incomplete results when differing keywords describe the same document and queries concept. Conceptbased retrieval tried to tackle this by using manual thesauri with term co-occurrence data, or by extracting latent word relationships and concepts from a corpus. Semantic search motivates Semantic Web from inception for classification and retrieval processes. In this paper, a model for the exploitation of semantic-based feature selection is proposed to improve search and retrieval of web pages over large document repositories. The features are classified using Support Vector Machine (SVM) using different kernels. The experimental results show improved precision and recall with the proposed method with respect to keyword-based search.. General Terms Web Mining
منابع مشابه
A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification
In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...
متن کاملA Technique for Improving Web Mining using Enhanced Genetic Algorithm
World Wide Web is growing at a very fast pace and makes a lot of information available to the public. Search engines used conventional methods to retrieve information on the Web; however, the search results of these engines are still able to be refined and their accuracy is not high enough. One of the methods for web mining is evolutionary algorithms which search according to the user interests...
متن کاملJoint Web-Feature (JFEAT): A Novel Web Page Classification Framework
With the increasing amount of web pages over the internet, it has been a major concern to obtain information on the internet accurately at a reasonable cost with decent performance. A potential solution is through the classification of web pages into meaningful categories. An effective classification of web pages is of benefit to various applications such as web mining and search engines. Unlik...
متن کاملFeature Selection with Rough Sets for Web Page Classification
Web page classification is the problem of assigning predefined categories to web pages. A challenge in web page classification is how to deal with the high dimensionality of the feature space. We present a feature reduction method based on the rough set theory and investigate the effectiveness of the rough set feature selection method on web page classification. Our experiments indicate that ro...
متن کاملAn Improved Optimized Web Page Classification using Firefly Algorithm with NB Classifier (WPCNB)
The web is a huge repository of information which needs for accurate automated classifiers for Web pages to maintain Web directories and to increase search engines‟ performance. In web page classification problem each term in each HTML/XML tag of each Web page can be taken as a feature, an efficient methods to select best features to reduce feature space of the Web page classification problem d...
متن کامل